1.1 - Linear Regression

Linear regression

We try to approximate $f$ by a function $\hat{f}:\mathbb{R}\rightarrow\mathbb{R}$, $x \rightarrow wx + b$, where $w, b \in \mathbb{R}$ How to select $w$ and $b$ that approximate the data quite well?

Question: How to find $w, b \in \mathbb{R}$ such that $\hat{f}(x) \sim f(x)$? –> Find $w, b$ such that $\forall i, \hat{f}(x_i) \sim y_i$

Sum of squared deviations

\[L(w, b) = \frac{1}{2} \sum_{i=1}^{N}|wx_i + b - y_i|^2\] ($wx_i + b - y_i$) is the distance from data points to estimated function This will give an exact fit for the data We want to find $w, b$ which minimizes $L(w, b)$ You may be wondering why there’s a $\frac{1}{2}$ is in the loss function. Well scaling a loss function by a constant does not change the location of its minimum (optimal $w$ and $b$ remain the same whether $\frac{1}{2}$ is included or not). Also it simplifies derivatives because when taking the derivate the square cancels out the $\frac{1}{2}$ which leaves a cleaner expression.

1st observation: $L$ is differentiable 2nd observation: $L$ is convex $(w, b)$ is a minimum of $L$ $\Leftrightarrow$ $\partial_w L(w, b) = \partial_b L(w, b) = 0$ ((is the same as)$\Leftrightarrow \nabla L(w, b) = 0$)

1.3.3

\[\partial_w L(w, b) = \frac{1}{2} \sum_{i=1}^{N}(wx_i + b-y_i)^2\] Apply chain rule $f'(g(x))g'(x)$ where $f(x) = x^2$ and $g(x) = (wx_i + b - y_i)$ \[\text{chain rule}= 2(wx_i + b - y_i) \cdot \frac{\partial}{\partial w}(wx_i + b - y_i)\] \[= \frac{1}{2} \sum_{i=1}^N \cdot 2 (wx_i+b-y_i)\cdot x_i\] \[ = \sum_{i = 1}^{N} x_i(wx_i + b - y_i)\] Now expanding $(wx_i + b - y_i)x_i = w_xi^2 + bx_i - y_ix_i$ \[ = w \sum_{i=1}^{N} x_i^2 + b \sum_{i = 1}^{N} x_i - \sum_{i = 1}^{N} x_iy_i\]

\[\partial_b L(w,b) = \sum_{i = 1}^{N}(wx_i+b-y_i)\] Also applying chain rule here: \[\frac{1}{2} \sum_{i=1}^N2(wx_i + b - y_i) \cdot \frac{\partial}{\partial b} (wx_i + b -y_i)\] \[= \sum_{i=1}^N (wx_i + b - y_i)\] \[ = w\sum_{i=1}^N x_i + b \sum_{i=1}^N 1 - \sum_{i=1}^N y_i \] $b\sum_{i=1}^N = b \cdot N$, thus: \[= w \sum_{i = 1}^{M} x_i+ N b - \sum_{i = 1}^{N}y_i\] The necessary and sufficient optimality conditions for \[\min_{w, b \in \mathbb{R}} \mathcal{L}(w, b)\] are given by $\frac{\partial}{\partial w}\mathcal{L} (w, b) = 0$ and $\frac{\partial}{\partial b} \mathcal{L}(w, b)$ and are equivalent to the linear system

1.3.4

\[\begin{pmatrix} \sum_{i=1}^N x_i^2 & \sum_{i=1}^N x_i \\ \sum_{i=1}^N x_i & N \end{pmatrix} \begin{pmatrix} w \\ b \end{pmatrix} = \begin{pmatrix} \sum_{i=1}^N x_i y_i \\ \sum_{i=1}^N y_i \end{pmatrix}\]

Explanation

Let’s reconsider matrix multiplications to understand the linear system given above: The first element (11) is multiplicated with (1) of the vector and added up with the product of $\sum{i=1}^N x_i *b$. Those products are added up so: \[\begin{pmatrix} \sum_{i=1}^N x_i^2 & \sum_{i=1}^N x_i \\ \sum_{i=1}^N x_i & N \end{pmatrix} \begin{pmatrix} w \\ b \end{pmatrix} = \begin{pmatrix} \sum_{i=1}^N wx_i^2 + \sum_{i=1}^Nx_ib\\ \sum_{i=1}^N x_i w + Nb \end{pmatrix} = \begin{pmatrix} w\sum_{i=1}^N x_i^2+ b\sum_{i=1}^Nx_i \\ w\sum_{i=1}^Nx_i + Nb \end{pmatrix}\] But how do we conclude to \[\begin{pmatrix} \sum_{i=1}^N x_i y_i \\ \sum_{i=1}^N y_i \end{pmatrix}\](here: $\frac{\partial}{\partial w})$ and after the AND its $\frac{\partial}{\partial b}$ $IMG_2188 1.jpg$ The explanation above should be enough to conclude that we get to the final equation in 1.3.4

This can be further simplified by defining the averages $\overline{x} := \frac{1}{N} \sum_{i=1}^N x_i$ and $\overline{y} = \frac{1}{N} \sum_{i=1}^N y_i$ Now by dividing every entry of the matrices by $N$ we get to: \[\begin{pmatrix} \frac{1}{N}\sum_{i=1}^N x_i^2 & \overline{x} \\ \overline{x} & 1 \end{pmatrix} \begin{pmatrix} w \\ b \end{pmatrix} = \begin{pmatrix} \frac{1}{N}\sum_{i=1}^N x_i y_i \\ \overline{y}\end{pmatrix}\] The determinant of the system matrix is: \[\frac{1}{N} \sum_{i=1}^N x_i^2 - \overline{x}^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \overline{x})^2\] and is non-zero exactly when there are at least two distinct data points $x_i$. In this case we can explicitly calculate $w$ and $b$, and the solution is given by (see the exercise sheet) $Screenshot 2025-01-26 at 23.01.44.png$

1.3.5

\[w = \frac{\sum_{i=1}^N (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^N (x_i - \overline{x})^2}\] \[ b = \overline{y} - w \overline{x}\] Unlike nearest-neighbor interpolation, the linear regression function $\hat{f}(x) := wx + b$ is continuous and easy to evaluate. Additionally, it is more robust to errors in the data $(x_i, y_i)$ Remark: Let \[X = \begin{pmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots& \vdots \\ x_N & 1 \end{pmatrix} \in \mathbb{R}^{Nx2}\] \[ \beta = \begin{pmatrix} w \\ b \end{pmatrix} \in \mathbb{R}^2\] \[y = \begin{pmatrix} y_i \\ . \\ . \\ . \\ y_N \end{pmatrix}\] we can rewrite the sum of squared deviations 1.3.3 \[L(\beta) = \frac{1}{2}|| X \beta - y ||^2\] where $|| \cdot ||$ denotes the euclidian norm on $\mathbb{R}^N$ (normal Euclidian norm: $\sqrt{x_1^2 + x_2^2 + \dots + x_N^2}$)

Explanation

Why does this work? When multiplying $X\beta$ we get \[X\beta = \begin{pmatrix}x_1w+b \\ x_2w+b \\ x_n w + b \end{pmatrix}\] Remember the euclidian norm above. This effectively replaces the sum $\sum_{i=1}^N wx_i + b$. Thus \[L(\beta) = \frac{1}{2} ||(X\beta-y)_i^2 = \frac{1}{2}||X\beta-y||^2\]

Exercise 1.3.1

Show that for a model of the form $\hat{f}(x) = wx$ with parameter $w \in \mathbb{R}$, the least squares solution is given by \[w = \frac{\sum_{i=1}^N x_i y_i}{\sum_{i=1}^N x_i^2}\] Solution: We have $\mathcal{L}(w) := \frac{1}{2}\sum_{i=1}^N (wx_i - y_i)^2$ $\frac{\partial}{\partial w}\mathcal{L} = \sum_{i=1}^N (wx_i - y_i)x_i$. Since we want to minimize the function we set $\frac{\partial}{\partial w}\mathcal{L} = 0$: \[\begin{align}\sum_{i=1}^N (wx_i - y_i)x_i &= 0 \\ \sum_{i=1}^N wx_i^2 -y_ix_i &= 0 \hspace{0.5cm} | + \sum_{i=1}^Nx_iy_i \\ w\sum_{i=1}^N x_i^2 &= \sum_{i=1}^N x_i y_i \hspace{0.5cm} | \div \sum_{i=1}^N x_i^2 \\ w &= \frac{\sum_{i=1}^N x_iy_i}{\sum_{i=1}^N x_i^2} \\ \square \end{align}\]

Exercise 1.3.2

Show that the gradient of the function $L$ in 1.3.3 is given by: \[\nabla L(\beta) = X^T(X \beta - y)\] Remember that for any vector, \[||\cdot||^2 = v^Tv\] Since $||\cdot||^2$ is just adding up the squared matrix elements its equivalent (multiplying each element with itself and then add all elements up) So: \[\begin{align}||X \beta - y ||^2 = (X\beta-y)^T (X\beta - y) \\ \end{align}\] And the loss function would be \[\mathcal{L}(\beta) = \frac{1}{2} (X\beta-y)^T (X\beta-y)\] \[\begin{align}(X\beta-y)^T (X\beta-y)& = (X\beta)^TX\beta - 2(X\beta)^T y + y^Ty \\ \end{align}\] Since the scalar transpose ($(X\beta)^T$ would be a $1 \times m$ row vector and we have a property for scalar values: $a^T = a$) does not change the value: Remember that $(X\beta)^Ty = y^T(X\beta)$. Since $(X\beta)^T = (X\beta)$ if its a scalar we can rewrite \[= \beta^TX^T X\beta - 2y^TX\beta + y^Ty\] We add $\frac{1}{2}$ again: \[\mathcal{L}(\beta) = \frac{1}{2}[\beta^T X^T X \beta - 2y^T X \beta + y^Ty]\] Thus: \[\mathcal{L}(\beta) = \frac{1}{2} [\beta^TX^T X\beta - 2y^TX\beta + y^Ty]\]Now for any quadratic from like $\beta^T A \beta$, where $A$ is a symmetric matrix, the gradient with respect to $\beta$ is: \[\nabla_\beta (\beta^T A \beta) = 2A\beta\] Since in $(\beta^TX^TX \beta)$ $X^TX$ is symmetric, applying this formula gives: \[\nabla_\beta (\beta^T X^TX\beta) = 2X^TX\beta\] However, we still have $\frac{1}{2}$ in front of it so we get: \[\nabla_\beta (\frac{1}{2}\beta^T X^TX\beta) = X^TX\beta\]

Differentiating second term $-2y^TX\beta$
For a linear term like $c^T\beta$ the gradient is simply the coefficient: \[\nabla_\beta(c^T\beta) = c\] Here, treating $X^Ty$ as a constant vector we get \[\nabla_\beta(\frac{1}{2}-2y^TX\beta) = -X^Ty\] Now differentiating $y^Ty$ which does not contain $\beta$ this is 0 Combining the results we get \[\begin{align}\nabla_\beta\mathcal{L}(\beta) &= X^TX\beta - X^Ty\\ \nabla{\beta}\mathcal{L}(\beta) &= X^T(X \beta - y) \end{align}\]

Next chapter: Supervised Learning